NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

NP-CAM: Efficient and Scalable DNA Classification using a NoC-Partitioned CAM Architecture

Morris, Benjamin F; Molom-Ochir, Tergel; Zhou, Changchun; Chen, Yiran; Jones, Alex; Li, Hai (February 2026, Proceedings of the 32nd IEEE International Symposium on High-Performance Computer Architecture (HPCA-32))

Free, publicly-accessible full text available February 9, 2027
AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

Yang, Zhuoping; Zhuang, Jinming; Chen, Xingzhen; Jones, Alex; Zhou, Peipei (November 2025, ACM)

GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexi- ble HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88×across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75×speedup through efficient design and overlapping; on graph applications, AGILE reduces soft- ware cache overhead by up to 3.12×and NVMe I/O overhead by up to 2.85×; AGILE also lowers per-thread register usage by up to 1.32×.
more » « less
Free, publicly-accessible full text available November 16, 2026
AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

https://doi.org/10.1145/3712285.3759778

Yang, Zhuoping; Zhuang, Jinming; Chen, Xingzhen; Jones, Alex; Zhou, Peipei (November 2025, ACM)

GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexible HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88 × across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75 × speedup through efficient design and overlapping; on graph applications, AGILE reduces software cache overhead by up to 3.12 × and NVMe I/O overhead by up to 2.85 × ; AGILE also lowers per-thread register usage by up to 1.32 ×.
more » « less
Free, publicly-accessible full text available November 15, 2026
DERCA: DetERministic Cycle-Level Accelerator on Reconfigurable Platforms in DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1109/RTSS66672.2025.00039

Ji, Shixin; Yang, Zhuoping; Chen, Xingzhen; Zhang, Wei; Zhuang, Jinming; Jones, Alex K; Dong, Zheng; Zhou, Peipei (December 2025, IEEE)

Deep neural network (DNN) models are increasingly deployed in real-time, safety-critical systems such as autonomous vehicles, driving the need for specialized AI accelerators. However, most existing accelerators support only non-preemptive execution or limited preemptive scheduling at the coarse granularity of DNN layers. This restriction leads to frequent priority inversion due to the scarcity of preemption points, resulting in unpredictable execution behavior and, ultimately, system failure. To address these limitations and improve the real-time performance of AI accelerators, we propose DERCA, a novel accelerator architecture that supports fine-grained, intra-layer flexible preemptive scheduling with cycle-level determinism. DERCA incorporates an on-chip Earliest Deadline First (EDF) scheduler to reduce both scheduling latency and variance, along with a customized dataflow design that enables intralayer preemption points (PPs) while minimizing the overhead associated with preemption. Leveraging the limited preemptive task model, we perform a comprehensive predictability analysis of DERCA, enabling formal schedulability analysis and optimized placement of preemption points within the constraints of limited preemptive scheduling. We implement DERCA on the AMD ACAP VCK190 reconfigurable platform. Experimental results show that DERCA outperforms state-of-the-art designs using non-preemptive and layer-wise preemptive dataflows, with less than 5 % overhead in worst-case execution time (WCET) and only 6% additional resource utilization. DERCA is open-sourced on GitHub: https://github.com/arc-research-lab/DERCA
more » « less
Free, publicly-accessible full text available December 2, 2026
ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1145/3716368.3735215

Ji, Shixin; Chen, Xingzhen; Zhuang, Jinming; Zhang, Wei; Yang, Zhuoping; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex; Dong, Zheng; et al (June 2025, ACM)

Real-time systems are widely applied in different areas like autonomous vehicles, where safety is the key metric. However, on the FPGA platform, most of the prior accelerator frameworks omit discussing the schedulability in such real-time safety-critical systems, leaving deadlines unmet, which can lead to catastrophic system failures. To address this, we propose the ART framework, a hardware-software co-design approach that transforms baseline accelerators into “real-time guaranteed" accelerators. On the software side, ART performs schedulability analysis and preemption point placement, optimizing task scheduling to meet deadlines and enhance throughput. On the hardware side, ART integrates the Global Earliest Deadline First (GEDF) scheduling algorithm, implements preemption, and conducts source code transformation to transform baseline HLS-based accelerators into designs targeted for real-time systems capable of saving and resuming tasks. ART also includes integration, debugging, and testing tools for full-system implementation. We demonstrate the methodology of ART on two kinds of popular accelerator models and evaluate on AMD Versal VCK190 platform, where ART meets schedulability requirements that baseline accelerators fail. ART is lightweight, utilizing <0.5% resources. With about 100 lines of user input, ART generates about 2.5k lines of accelerator code, making it a push-button solution.
more » « less
Free, publicly-accessible full text available June 29, 2026
MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

https://doi.org/10.1109/TCAD.2025.3541486

Tang, Yue; Jones, Alex K; Xiong, Jinjun; Zhou, Peipei; Hu, Jingtong (January 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while re-training the model during the night to adapt to new environments, domains, or new users. During the re-training, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multi-accelerator training algorithms select optimal scheduling strategies for data parallelism, tensor parallelism, and pipeline parallelism. However, pipeline parallelism cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying data parallelism and tensor parallelism suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multi-accelerator training scheduling strategy that transfers the training process into a multi-branch workflow, thus independent sub-operations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with 1.07x-2.21x speedup under 15 GB/s peer-to-peer bandwidth compared to the state-of-the-art work.
more » « less
Full Text Available
Amortizing Embodied Carbon Across Generations

Ji, Shixin; Zhuang, Jinming; Yang, Zhuoping; Jones, Alex; Zhou, Peipei (November 2024, IEEE)

Data centers have been relying on renewable energy integration coupled with energy efficient specialized processing units and accelerators to increase sustainability. Unfortunately, the carbon generated from manufacturing these systems is be- coming increasingly relevant due to these energy decarbonization and efficiency improvements. Furthermore, it is less clear how to mitigate this aspect of embodied carbon. As workloads continue to evolve over each hardware generation we explore the tradeoffs of fabricating new application-tuned hardware compared with more general solutions such as Field Programmable Gate Arrays (FPGAs). We also explore how REFRESH FPGAs can amortize embodied carbon investments from previous generations to meet the requirements of future generations workloads.
more » « less
Full Text Available
Towards Accelerator Customization in Real-time Safety-critical Systems

https://doi.org/10.1145/3706628.3708841

Ji, Shixin; Chen, Xingzhen; Zhang, Wei; Yang, Zhuoping; Zhuang, Jinming; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex K; Dong, Zheng; et al (February 2025, ACM)

Free, publicly-accessible full text available February 27, 2026
SPIMulator: A Spintronic Processing-in-memory Simulator for Racetracks

BERA, PAVIA; CAHOON, STEPHEN; BHANJA, SANJUKTA; JONES, ALEX (September 2024, ACM transactions on embedded computing systems)

Full Text Available
Reducing Smart Phone Environmental Footprints with In-Memory Processing

Yang, Zhuoping; Zhang, Wei; Ji, Shixin; Zhou, Peipei; Jones, Alex (October 2024, IEEE)

Smart phones have revolutionized the availability of computing to the consumer. Recently, smart phones have been aggressively integrating artificial intelligence (AI) capabilities into their devices. The custom designed processors for the latest phones integrate incredibly capable and energy efficient graphics processors (GPUs) and tensor processors (TPUs) to accommodate this emerging AI workload and on-device inference. Unfor- tunately, smart phones are far from sustainable and have a substantial carbon footprint that continues to be dominated by environmental impacts from their manufacture and far less so by the energy required to power their operation. In this paper we explore the possibility of reversing the trend to increase the dedicated silicon dedicated to emerging application workloads in the phone. Instead we consider how in-memory processing using the DRAM already present in the phone could be used in place of dedicated GPU/TPU devices for AI inference. We explore the potential savings in embodied carbon that could be possible with this tradeoff and provide some analysis of the potential of in- memory computing to compete with these accelerators. While it may not be possible to achieve the same throughput, we suggest that the responsiveness to the user may be sufficient using in- memory computing, while both the embodied and operational carbon footprints could be improved. Our approach can save circa 10–15kgCO2e.
more » « less
Full Text Available

« Prev Next »

Search for: All records